GATHERING DATA

The project is to analyse the @WeRateDogs data from its Twitter Account up to the period of August, 2017. Data for these analyses are sourced programmatically using python's libraries and gotten from three sources: the first is the twitter-archive-enhanced dataset already supplied by Udacity, the second will be scraped from a web url and the third using Twitter's API.

Querying Twitter API to access the third data

import tweepy consumer_key = "xxxxxxxxxxxxxxxxxxxxxxxxxxxx" consumer_secret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" access_token = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx" access_secret = "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_secret) api = tweepy.API(auth, wait_on_rate_limit = True)

ASSESSING @WeRateDogs Twitter Datasets

The three gathered data sets will be assessed visually and programmatically for quality and tidiness issues.

ASSESSING THESE THREE FILES VISUALLY AND PROGRAMMATICALLY

After Assessing the three datasets, the following Quality and Tidiness issues were discovered:

Quality Issues

Twitter Table

  1. Missing records in the 'in_reply_to_status_id and in_reply_to_user_id (78 instead of 2356)
  2. Missing records for retweeted_status_id, retweeted_user_id, retweeted_status_timestamp (181 instead of 2356)
  3. Erroneous datatype for timestamp (object instead of datetime)
  4. Erroneous datatype for floofer, puppo, pupper and doggo (object instead of category)
  5. Some Dog names are represented with letter 'a','an','o' and so on
  6. Missing dog names represented with 'None'
  7. improper extraction of ratings from the text column in twitter table resulting in:
  8. Erroneous datatype for rating_numerator column (float not integer)

Predictions Table

  1. some records in p1, p2 and p3 columns are not dog breeds
  2. some dog names are in lowercase, others start with uppercase

Tidiness Issues

  1. doggo, floofer, pupper, puppo are different stages of dog growth and should be in one column in Twitter table
  2. retweet_count and favorite_count in Twits table should be part of Twitter table including dog breeds with the highest confidence level in twit_predictions table

CLEANING

The identified quality and tidiness issues above will be cleaned.

  1. Missing records in the 'in_reply_to_status_id and in_reply_to_user_id (78 instead of 2356)
  2. Missing records for retweeted_status_id, retweete_user_id and retweeted_status_timestamp(181 instead of 2356) and presence of non-original values or retweets

Define:

The missing records in the aforementioned columns are too many, they are ids and simply can't be filled with mean or min values, besides, these columns are not relevant for my EDA since they are based on retweets and I need original ratings so I will drop them icluding the source and expanded url columns. More than 95% of the source is an iphone and the expanded urls are just that...expanded urls, will drop all of them.

Presence of the retweeted status id and in-reply-to-staus-id columns indicates that there are retweets. I will first check their rows to see if they have values. These values if present will be dropped first before the columns to ensure only originals remain.

Code

Test

  1. Erroneous datatype for timestamp (object instead of datetime)
  2. Erroneous datatype for floofer, doggo, pupper and puppo columns (this will be done after these columns have been collapsed into one column) ### Define object datatype erroneously assigned to timestamp will be converted to datetime

Code

Test

  1. Some Dog names are represented with letter 'a','an','o'...
  2. Missing dog names represented with 'None' ### Define These two quality issues occur in the same column. 745 dog names are filled with None and 55 with the letter 'a', along side others. These will not be removed. Instead, I will replace them with "dog"

Code

Test

  1. improper extraction of ratings from the text column in twitter table resulting in:
  2. Erroneous datatype for rating_numerator column (float not integer) ### Define Some dogs have their ratings as float but these are wrongly recorded as integers with the wrong values extracted e.g., 5 instead of 13.5. Find below.

Code

Test

Looking at the rating numerators and denominators, some values are more 10, in some cases, where denominator > 10, multiple dogs are present in the image and the rating reflects that. The fact that these ratings are greater than 10 does not need to be cleaned. This unique rating system is a big part of the popularity of WeRateDogs besides, these values (numerator and denominator) matches the ratings found in the text.

These odd values can really skew data and affect analysis. Lets look at the ratings numerators and denominators that have large values.

After going through each of these jpg_urls and their texts, It is evident that these seemingly odd ratings are based off the number of dogs posted on the @WeRateDogs twitter account. For example, rating 204/170 had 17 dogs in the picture posted which means each dog gets an individual rating of 17, 80/80 had 8 dogs in the picture thus each dog has a rating of 10. I will replace these group ratings with the individual ones to achieve a uniform dataset.

One picture though is of Snoop Dogg (420/10), one isn't a rating (24/7) and a rating was tremendously high (1776/10). I will drop these to get a workable data.

Code

Test

  1. some records in p1, p2 and p3 columns are not dog breeds ### Define @WeRateDogs only rates dogs not fruit or household items. Any value other than dog breed name will be removed from p1, p2 and p3 columns in the twit_predictions_clean table. We will use the p-dog columns to address that. All rows with p-dog equating to False will be removed.

Code

Test

Replicate for p2_dog and p3_dog columns

  1. some dog names are in lowercase, others start with uppercase ### Define Some of the dog breed names starts with upper case while others start with lowercase. This is inconsistent. These will be converted to lowercase.

Code

Test

  1. doggo, floofer, pupper, puppo are different stages of dog growth and should be in one column in Twitter_clean table ### Define doggo, floofer, pupper and puppo are one variable type and should be in one column as a category type. To achieve this, I will extract these categories from the text column into a new column labelled dog_stage, this way, all the dog variables are in one column, then drop the floofer, puppo, doggo and pupper columns

Code

Some dogs are classified as both doggo and pupper. This will be considered while merging the doggo, puppo, pupper and floofer columns into a single column

Test

  1. retweet_count and favorite_count in Twits table should be part of Twitter table including dog breeds with the highest confidence level in twit_predictions table. Since p1 is the algorithm's #1 prediction for the image in the tweet and p1_conf is how confident the algorithm is in its #1 prediction. I will only work with these columns and drop the rest### Define In order to achieve this, I will split the twit_predictions_clean dataframe into 3 sub-dataframes, rename the p1, p2 and p3 columns to dog_breed and p1_conf, p2_conf and p3_conf columns to p_conf and concatenate the three sub_dataframes into one.

Code

renaming id in twits_clean table to tweet_id for uniformity

Merging all three datasets

Test

STORING DATA

ANALYZING AND VISUALIZING @WeRateDogs DATASET

@WeRateDogs is a twitter account that analyzes people's dogs with a unique and peculiar rating system. @dogrates data collected from different sources and merged into a "twitter_archive_master.csv" dataset will be analyzed for insights into their rating system. The ratings are the dependent variable while the favorite count (likes), retweets and others are the independent variables. The following questions are used to get insights into their rating system

  1. Which dog breeds are most popular or had the most ratings?
  2. which dog stage is most popular or had the most ratings?
  3. Do dog breeds with higher ratings have the most likes and retweets?
  4. Is there a relationship between ratings, likes and retweets?
  5. Does more likes guarantee more retweets and vice versa?
  6. What is the period of analyses?

Assessing the Twitter Archive Master Dataset

ANALYZING AND VISUALIZING THE TWITTER ARCHIVE MASTER DATASET

  1. Which dog breed is most popular or had the most ratings?

To start off, I will look at the distribution of the ratings

Most of the ratings are between 10 and 13

  1. Which dog stage is most popular or had the highest number of ratings? In order to prevent None been listed as a dog stage type in the dog stage column, I will replace none to nan
  1. Do Dogs with higher ratings have more likes and retweets? To answer this question, I will use the cut function to segment and sort rating_numerator values into bins in a new column and visualize using matplotlib.

From the bar charts above, on average dogs with high ratings does have high number of retweets and likes (favorite count)

  1. Is there any relationship between ratings, retweets and likes (favorite count)?

From the two graphs above, after a high rating of about 14, higher ratings do not have significantly more likes and retweets. Infact, ratings>14 seem to have similar likes and retweet counts with zero ratings.

  1. Any relationship between favorite count (likes) and retweet count?

Since I cannot get a clear relationship between ratings, retweets and favorite counts, I will look at the period of retweets and favorite counts and see if time is an influencing factor on @WeRateDogs dog ratings

Now which month and year had the highest number of retweets and favorite count?

@WeRateDogs twitter account had the highest number of dog postings on december 2015 with 371 people posting their dogs but saw the highest favorite_count and retweet_count activity on the August 2017 with 1,449.849 likes and on December 2015 with 460,727 retweets, but, the highest likes and retweets received for a dog posting was on June 2016 with 144,904 likes and 70,751 retweets. Lets look at the dog breeds and their ratings during these periods

From the evaluations above, dog ratings aren't influenced by number of postings or how busy the @dogrates twitter account was.

Observations

  1. Golden Retrievers were most popular during the period of analyses with the most number of ratings
  2. Labrador retrievers of the pupper stage were the most popular
  3. The most popular dog breeds didnot have the highest number of likes and retweets
  4. Ratings of 14/10 and below were most popular
  5. There is a strong correlation between favorite count and retweets, although not enough to infer causality but dog breeds with high number of favorite count have fairly high number of retweets
  6. Some dogs had zero likes, zero ratings but high number of retweets.
  7. Ratings are not influenced by activity in the account or by time

Conclusion

  1. Dogs are rated based on pictures of dogs sent to @WeRateDogs by the creators of the account. It is neither influenced by the number of likes, retweets, time or how busy the account is during the time of analysis. In addition, postings of dogs in pairs are cumulatively rated i.e., if a picture of 5 puppers are posted in one time, they are rated together like 500/100 meaning 10/10 per pupper.

LIMITATION(S)

  1. Lots of missing values for the dog stage, couldnt determine the actual dog stage type that is most popular. I had to use the top ten dog breeds to determine that
  2. Not all the dog breeds had a name or picture.